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The benefits of item response theory (IRT) over classical test 
theory have been espoused widely by many test specialists (e.g., 
Hambleton, 1989; Lord, 1980). These test speciaHsts assert that IRT 
offers test developers increased measurement precision, and so tests 
developed using IRT provide accurate assessment of examinee ability (or 
proficiency) using fewer items than tests developed using classical 
procedures. 

The efficient precision of measurement provided by IRT is 
accomplished by placing person and item parameters on the same 
measurement scale (i.e., item and person parameters are scaled in the 
metric of the underlying latent trait). Because the person and item 
parameters are on the same measurement scale, they are sample- 
independent. That is, the person parameters (ability estimates) are 
independent of the particular sample of items administered, and the item 
parameters (difficulty, discrimination, guessability) are independent of the 
particular sample of examinees tested. This feature of IRT allows for 
direct equating of tests assembled from a common pool of items, and 
provides an unambiguous means for combining information provided by 
different item types onto a common scale. 

Though IRT offers many benefits to test developers, it has one clear 
limitation: relatively large numbers of examinees (sample sizes) must be 
tested to provide accurate results. This limitation is unfortunate because 
many tests are administered to, and developed from, relatively small 
numbers of examinees. For this reason, most applications of IRT in test 
use and development are found in large-scale testing organizations. 

Previous research on IRT with small samples has concluded that 
sample sizes under 200 are not appropriate for even the simplest (i.e., 
least general) IRT models (e.g., one-parameter logistic model) and that 
much larger samples are required for the more complex (e.g., two- and 
three-parameter) models (c.f. Hulin, Lissak, & Drasgow, 1982; Lord, 
1968; Ree & Jensen, 1980; Thissen & Wainer, 1982; Wright & Stone, 
1979). However, some recent research investigating modifications of 
these traditional IRT models has indicated that modified IRT models may 
be appropriate for use in some small-scale testing applications (Barnes & 
Wise, 1991; Sireci, 1991). 

/ 

i 

This study investigated the utility of modified IRT models in a small- 
sample testing application. The modified IRT models used were 
modifications of ^e one- and two-parameter logistic models. The purpose 



of this investigation was to determine whether these modified models 
would be appropriate in small-sample testing applications. 

The test data analyzed in this study were part of a national 
certification examination for persons desiring certification in personal 
financial planning. The data represented four separate administrations of 
the examination over a four-year period. Because the requirements to sit 
for the examination were fairly stringent, only about 150 persons sat for 
the examination each year. The number of examinees (sample sizes) 
who sat for the examination each year was 173, 149, 106, and 159, for 
years 1 through 4, respectively. The examination was comprised of 100 
multiple-choice items, and separate tesi ,urms were administered each 
year. The test forms were constructed to be parallel and were equated 
using a common-item (nonequivalent groups) linear equating procedure 
(Angoff, 1984; Kolen & Brennan, 1987). There were 13 items in common 
among the four test fonns. The data for these 13 itemi' were aggregated 
over the four-year period so tliat comparisons could be made between the 
small-sample data (i.e., the data from a single test administration) and the 
aggregate data (i.e., the data combined for the 13 items over the four-year 
period). 

Item Parameter Stabilitv 

The first part of the investigation evaluated the stability of the item 
parameters over the four-year period. Item parameter stability was 
evaluated by using restricted and unrestricted IRT models and comparing 
their fit to the data. The unrestricted IRT models computed the item 
parameters for each group separately, the restricted models 
constrained the item parameters to be equal among the four groups. 
Thus, the restricted models represented item parameter stabihty (item 
parameters were equal from sample to sample), and the unrestricted 
models represented item parameter instability (i.e., the item parameters 
were not equal across samples).2 



^Restricted ERT models have been used previously in a variety of research contexts. For example, 
Stone and Lane (199 1) used restricted ERT nxxiels to investigate item parameter stability 
over time; Thissen, Steinberg, and Gerrard (1986), and Thissen, Steinberg, and Wainer 
(1988, in press) used restricted IRT nxxiels to investigate differential item functioning; and 
Wainer, Sireci, and Thissen (1991) used restricted IRT models to investigate differential 
testlet functioning. 
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The purpose of the analysis of item parameter stability was to 
determine whether an IRT model could be directly applied to a small- 
sample data set. If item parameter stability was exhibited over the four 
groups, then the item parameters would be appropriate for estimating 
examinee proficiency. Using restricted IRT models, Sireci (1991) found 
that item parameter stability did not hold over three separate small-sample 
test administrations. However, tlie IRT models in JJie Sireci (1991) study 
did not include a fixed lower-asymptote, which Barnes and Wise (1991) 
suggested for use with small data sets. 

Mixed IRT Models 

The second part of the present investigation evaluated the utility of 
"mixed" IRT models for small data sets. Mixed IRT models use more 
than one IRT model in a single analysis. Using a mixed IRT model, some 
test items could be modeled using a IPL, while other items could be 
modeled using a 2PL, and etc. Thissen (1991) demonstrated how mixed 
IRT models can be used to include different item types (e.g., multiple- 
choice items and categorical items) in a single analysis run. The purpose 
of using mixed IRT models in the present study was to demonstrate how 
incorporation of prior information (i.e., incorporation of item parameters 
based on an aggregated data set) can increase the precision of IRT 
estimates based on small samples. 

The One-. Two-, and Three-Parameter IRT Models 

The three IRT models used in this study were the one-, two-, and 
three-parameter logistic models (IPL, 2PL, and 3PL). There are several 
thorough descriptions of these and other ERT models available in the 
literature (e.g., Hambleton, 1989; Lord & Novick, 1968; and Thissen & 
Steinberg, 1986), and so they are not described in detail here. The 
equations for Jie IPL, 2PL, and 3PL, respectively are presented below: 



P(e) 



l-i-exp[-a(e-6)] 



(1) 



1 



(2) 



l + exp[-a(e-Z))] 



c + (l-c) 



(3) 



l + exp[-fl(e-fe)] 
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where P (0) is the probability of choosing the correct answer as a 
function of 0; b is the difficulty level of the item, a is the slope of tlie itern 
characteristic curve (ICC) at the point 0 = b, and c is tlie lower 
asymptote of the ICC. The item parameters d, b, and c are commonly 
referred to as the discrimination, difficulty, and lower-asymptote (or 
guessing) parameters, respectively, [a is fixed in the IPL and indicates a 
constant value of discrimination.] 'ITie restricted IRT models used in this 
study involved constraining the a, b, and/or c parameters to be equal for 
identical items taken by examinees in one of the four different groups. 
The modified IRT models used in this study involved fixing one or more of 
these parameters to be equal to some pre-specified value. 

Comparing Model Fit: -21ogs and X2 

All IRT analyses reported here were conducted using the 
MULTILOG (version 6.0) IRT software program (Thissen, 1991). 
MULTILOG is a very general program that fits a variety of IRT models 
to test data using the marginal maximum likelihood method (Bock & 
Aitken, 1981). MULTILOG uses a maximum likelihood procedure and so 
"negative twice the log likelihood" values (-21oglks) are provided for each 
analysis. Because the difference between the -21oglks of two competing 
(i.e., hierarchical) IRT models is distributed as chi-square, this difference 
can be evaluated for statistical significance by computing the probabiUty of 
obtaining the observed difference by chance (with degrees of freedom 
equal to the difference between the number of free parameters estimated 
in each model). If the additional parameters in the more general 
(imrestricted) model adds substantially to the data-model fit, then the 
difference between the -21oglks will be significanl. However, if the 
difference is not significant, then the more parsimonious (i.e., restricted) 
model is preferred. This chi-square difference test is appropriate only for 
comparing hierarchical models (i.e., the more general model estimates all 
of the parameters of the restricted mod^^l, plus some additional ones). 

Procedure and Results 

Assessing dimensionality. To determine whether the test items 
were appropriate for IRT analysis, an inter-item tetrachoric correlation 
matrix was computed for the 13-item data set based on the aggregated 
data of 587 (173+149+106+159) examinees. A one-dimensional (factor) 
model was fit to this inter-item correlation matrix using LISREL-7 
(Joreskog & Sorbom, 1988). This preliminary analysis was conducted to 
determine whether the unidimensional assumption of IRT was satisfied. 
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The one-factor model accounted for over 66% of the variance in the data 
and exhibited low values of residual error (coefficient of determination 
was .66, RMSE=.06, and the standard errors for the items ranged from 
.049 to .051). Although the assessment of test dimensionality using item 
intercorrelations is controversial (cf. Gorsuch, 1983; Green, 1983) these 
results were considered to be indicative of unidimensionality, and so the 
IRT models were deemed appropriate. 

Determining model fit for the aggregated data. The IPL, 2PL, and 
3PL models were fit to tlie aggregated data set to determine the most 
appropriate model for these data. Priors for the lower asymptotes (c 
parameters) were set at .25, which was the reciprocal of the number of 
response alternatives. The results of these analyses are presented in 
Table 1 . The significance tests of the differences between the -21oglks of 
the three models indicated that the 2PL was the appropriate model for 
these data. This improvement in fit of the 2PL over the IPL is consistent 
with a preliminary analysis of the data thiat indicated moderate variation 
among the item biserials (thus undermining the constant discrimination 
assumption of the IPL). The lack of improvement in fit for tlie 3PL may 
represent either the absence of a guessing factor among the less- 
proficient examinees, or may result from an inability to compute accurate 
lower asymptotes because of the relatively small sample size (Thissen & 
Wainer, 1982). 

Table 1 

Results of J PL, 2PL and 3PL Analyses on Aggregated Data 

(N=587) 

# Free Difference 
Model -21oglk Parameters X2 . df £ 

3PL 1578 39 

2PL 1587 26 9 13 .78 

IPL 1627 14 49 25 .002 



Because of the reported difficulty in estimating lower asymptotes 
{cj's) from relatively small data sets, and because Barnes & Wise (1991) 
recommended incorporating a fixed value for the asymptotes into a one- 
parameter model, modified IPL (MOD-IPL) and modified 2PL (MOD- 
2PL) analyses were conducted. These modified models added a fixed 
constant lower asymptote to the IPL and 2PL. The cj's for both modified 
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models were fixed at .20, which was the reciprocal of the number of 
response alternatives minus .05. This value was used based on previous 
research by Bames and Wise (1991) and Divgi (1984). The results for 
the MOD- 1 PL and M0D-2PL analyses are presented in Table 2. The 
MOD- 1 PL analysis resulted in a smaller -21oglk than tlie IPL; however, 
this logiikelihood was significantly different from that obtained by the 
MOD-2PL. Therefore, it appears that the assumption of equal slopes 
(discrimination) among the items is not appropriate for these data. The 
-21oglk for the M0D-2PL was identical to the value obtained in the 2PL 
analysis and so it is unclear whether the addition of the lower asymptote 
improves the performance of the 2PL model. 



Table 2 

Results ofMOD-lPL and M0D-2PL Analyses on Aggregated Data 
("MOD" indicates the inclusion affixed, non-zero cj's) 

(N^587) 







# Free 


Difference 






Model 


-2lQglk 


Parameters 


Xl 




12 


MOD-2PL 


1587 


26 


9 


13 


.78 


MOD-IPL 


1627 


14 


49 


25 


.002 



Determining item parameter stability. To determine whether any of 
the ERT models could be applied directly to a small-sample data set (i.e., 
the data from one of the four groups), the stability of the item parameters 
across the four groups (samples) was investigated.3 The results of these 
analyses are reported in Table 3. [The input command file used to fit the 
MOD-2PL model is presented in Appendix A to illustrate how the 
constraints were imposed via MULTILOG.] A comparison of the 
restricted (item parametei" stability) arid unrestricted (instability) -21oglks 
indicated tiiat item parameter stability was not exhibited for the IPL, 2PL 
or 3PL models. Therefore, direct application of these IRT models to any 
one of the samples would not be appropriate. 



^Differences in proficiency between the examinee samples was not expected to affect the 
results of this analysis. Analysis of the mean theta values for each group were not 
statistically significant. Furthermore, Stone and Lane (1991) reported that item parameter 
stability held over time for groups that differed in proficiency. 
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Aside from the traditional IPL, 2PL, and 3PL models, the stability of 
the item parameters resulting from four other IRT models was 
investigated (also reported in Table 3). The first two models represent 
restricted 2PL models that were investigated by Sireci (1991). The model 
\dbc\td"aj's only" constrained only the slopes of the 2PL to be equal 
among the four samples. The model "bj's only" restricted only the 
location (difficulty) parameters to be equal among the four groups. 
Though these two models exhibited better fit tlian the fully-restricted 2PL, 
neither the aj's nor bj's exhibited satisfactory stability. The other two 
models investigated were the modified IPL (MOD-IPL) and 2PL (MOD- 
2PL) models that incorporated a fixed value of .20 for the lower 
asymptote parameters. These models failed to exhibit stability over 
the four small-sample data sets and so it was concluded that none of the 
IRT models studied were appropriate for these small-sample data. 



Table 3 

Results of Item Parameter Stability Analyses 



# Free Difference 



Model 


-21oglk 


Parameters 






a 


3PL 




159 








Unrestricted 


1302 






<.001 


Restricted 


1548 


42 


246 


117 


2PL 












Unrestricted 


1305 


107 






<.001 


Restricted 


1557 


29 


252 


78 


aj's only 


1364 


68 


59 


39 


.018 


bj's only 


1372 


68 


67 


39 


.003 


IPL 












Unrestricted 


1409 


56 






<.001 


Restricted 


1610 


17 


201 


39 


MOD-2PL 












Unrestricted 


1306 


107 






<.001 


Restricted 


1548 


29 


242 


78 


MOD-IPL 




56 








Unrestricted 


1397 






<.001 


Restricted 


1560 


17 


163 


39 
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The Utility of Mixed IRT Models and Aggregated Data 



Ihe preceding analyses indicated that the IRT models used were 
not appropriate for the small sample data. However, it is possible that the 
item parameters obtained from tlie s 'gregated data analyses are 
appropriate (i.e., stable). Though there is no way to determine the 
stability of the parameters estimated from the aggregated dota ^aside from 
waiting several years to cross-validate on a new aggregated sample), we 
can investigate whether the item parameters obtained from the 
aggregated data are beneficial in calibration of parameters in a single 
(small) sample run. The purpose of this section is to determine whether 
such aggregated data can be beneficial to the small-sample test 
practitioner. 

If common items exist across several small-sample administrations 
of a test (as was the case witli the present study), then the data on these 
common items could be aggregated over administrations. The item 
parameters obtained from analysis of the aggregated data are likely to be 
more stable than those based on the small-sample administrations. If 
appropriate, these item parameters could then l^e used for item selection, 
scalmg, equating, and scoring of subsequent test forms. 



IRT item parameters based on aggregated data. To investigate the 
utility of using aggregated data, some item parameters from the 2PL 
analysis based on the aggregated data were selected for inclusion in a 
mixed-model analysis on the data for a single administration (Group 4, 
n=159). The item parameters that resulted from the 2PL analysis on the 
aggregated data set arc presented along with the content area 
specification for each item (for the five content areas measured by this 
test) in Table 4 . Although a few parameters have high standard errors, 
these standard errors are very small in relation to the standard errors 
observed in the unrestricted models reported above (i.e., based on 
separate calibrations for each sample). The data in Table 4 represent 
typical data that can be computed readily by the small-sample test 
practitioner who has several items in common over separate 
administrations of an examhation. Because many small-sample test 
forms are equated using common-item equating procedures, it is likely 
that many of these practitioners could easily create such aggregated data 
sets. The five items that were selected for the mixed-model IRT analysis 
(MIX) are highlighted in Table 4. 
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Table 4 

Item Parameters from 2PL Analyses on Aggregated Data 



Item 


Content 
Area 


ai 


(s.e.) 


„ bj ,,fs.e.) 


1 


A 


.33 


(.11) 


1.72 (.66) 


2 


A 


.10 


(.10) 


-2.94 (3.14) 


1 


A 


.82 


(.15) 


-1.61 (.30) 


4 


B 


.68 


(.15) 


-2.58 (.53) 


i 


B 


.24 


(.11) 


.66 (.55) 


6 


C 


.98 


(.15) 


- .53 (.12) 


I 


C 


1.22 


(.17) 


- .93 (.12) 


8 


C 


.51 


(.15) 


-3.67 (.99) 


9 


C 


.53 


(.13) 


-1.97 (.49) 




D 


.60 


(.12) 


- .15 (.18) 


11 


D 


.41 


(.11) 


-1.15 (.39) 


12 


D 


.56 


(.12) 


- .90 (.26) 


U 


E 


.62 


(.14) 


-2.24 (.47) 



Note: Values are scaled to Mu=0.0 and SD=1.0 

Items in >K>ldfacc indicate items selected for mixed analysis reported below. 



In selecting items to be used on future ter,t forms, both statistical 
and content criteria must be satisfied. Therefore, a resourceful test 
developer would most likely select items within each cor'ent area that 
fulfill the content specifications of the test and demonstrate satisfactory 
statistical criteria. Such statistical criteria would include satisfactory 
difficulty and discrimination values. Furthermore, in testing situations 
where cut-off scores are used, such as in licensure or certification testing, 
tlie test developer would also want to select items that maximize 
discrimination (test information) at the cut-score. 

Given such considerations, the test developer could select items 
based on the aggregated IRT parameter estimates and incorporate these 
estimates into a mixed model analysis. Items could be selected that 
maximize the information around the cut-score (given necessary content 
constraints). The parameters for the re-used (common) items could be 
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fixed at their values obtained from the aggregated data, while tlie new 
items on a test foi-m could be fit using a parsimonious IRT model such as 
the IPL or MOD- 1 PL, Using this procedure, relatively few item 
parameters would need to be estimated and the need for large amounts oi 
data would be obviated. 

To test whether this procedure would result in increased test 
infonnation for these data, one item from each of L»^e five content areas 
was selected based upon the 2PL difficully and discrimination parameters 
and their standard errors (the selected item numbers are printed in 
]boldfac£ in Table 4). A mixed-model IRT analysis (MIX) was 
perfonned on the data from one of the samples (Group 4, n==i59), and 
IPL and MOD- IPL analyses were applied to the same data. An 
additional model, MOD-MIX was also applied, lliis model added a fixed 
lower asymptote (at .20) to the MIX model . The input command file for 
the MOD-MIX analysis is reproduced in the Appendix B. The commands 
in this input file illustrate how to impose the necessary equality consu-aints 
among the new (IPL) items, and how to fix the cj parameters for ail 13 
items, and the aj and bj parameters for the 5 common items. 

To evaluate tlie relative contribution of tlie prior information (i.e., 
the 2PL item parameters estimated from the aggregated data) item and 
test characteristic curves were computed for four IRT models that were 
fit to the data. The test information curves were computed for each of the 
models to deteniiine whether increased test infonnation was olnained by 
using the parameters based on the aggregated data. 

Test information. Test information curves (TIC) depict the 
reciprocal of the standard error values at any point along the ability scale. 
Thus, larger amounts of information indicate smaller amounts of 
measurement error. The prefeiTed shape of a TIC varies according to 
the purpose of the test (Lord, 1977; Hambleton, 1989; Thissen, 1990). For 
tests that are designed to discriminate between examinees along the 
entire continuum of proficiency (tiieta), platykrtic (flat) curves are 
preferable. For tests that use cut-off scores, lept('»kurtic curves are 
prefen-ed that peak (maximize information) at the level of theta that 
corresponds to the cut-score (and so skewness would be determined by 
the location of the cut-score). Regardless of the shape desired, tests that 
generate TICs that have larger upper asymptotes are preferable to diose 
with lower upper asymptotes. 

Figure 1 presents the test informrnion curve (TIC) resulting from a 
IPL analysis of the Group 4 data. Fig are 2 presents the TIC for. the 
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MOD- 1 PL analysis for these same data. A comparison of Figures 1 and 
2 reveals that inclusion of the fixed cj's increased test information along 
the theta (6) range -1 to -t-3. However, the IPL exhibited greater 
information at the lower end of the 0-scale. 

■Figure 1: Tesi Information Curve for IPL (Group 4 Data) 
2.0i 




Eiguj:e_2: Test Information Curve for MOD~lPL (includes fixed cj's) 
2.01 




•1-3 
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Figure 3 presents the TIC for tlie MIX analysis. The incorporation 
of the fixed parameter values from the aggregated 2PL analysis did not 
increase test information at any point along the ©-scale. However, when 
the fixed cj's were incorporated into the model (MOD-MIX), the shape 
of the TIC changed dramatically. The TIC for the MOD-MIX model is 
illustrated in Figure 4. The incorporation of the fixed cj's increased the 
test information substantially over the B-scale tange -2 to +.5, and 
appears to peak at 0=-l. This value of theta is equivalent to one standard 
deviation unit below the population mean and is a commion cut-off score 
used by many licensing and certification programs. Tlius, the TIC 
produced by the MOD-MIX analysis may be useful in these application 
areas. 



Figure 3 : Test Information Curve for MIX Model (Group 4 data) 



2,0i 




0.0 



-3 



-2 



0 

e 
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Figure 4 : Test Information Curve for MOD-MIX (includes fixed cj's) 
2.0i 



1(9) 1.0 





0 

e 



RMSE. Though the test information curves are informative 
regarding measurement precision, the standard errors associated with the 
score of each examinee provides another means for evaluating the 
precision of IRT-based scores. To evaluate the precision of the MOD- 
MIX model, root mean square residual errors (RMSE) were calculated 
for the group 4 examinees using both the IPL and MOD-MIX models. 
The RMSE index is used widely in simulation studies to estimate the 
degree of departure of IRT estimates from their known parameters (e.g., 
Barnes & Wise, 1991; Thissen, 1990). Though the "tme" proficiency 
estimates of the examinees in the study were not known, tiieir proficiency 
estimates provided by the aggregate analysis can serve as a reference for 
the proficiency estimates computed from the single-sample runs. Thus, 
the RMSE were computed by taking the difference between each 
examinee's ability estimate ("theta-hat") from the M0D-2PL analysis 
(using the aggregated data) and an alternative model (either MOD-MIX 
or IPL), and then taking the square root of the average of these squared 
differences. 

Table 5 presents the mean, standard deviation, and range of the 
standard errors of the theta-estimates, for the 159 exammees in Group 4, 
for each of the thiee models of interest The RMSE for the MOD-MIX 
and IPL models (using the M0D-2PL as the "true" model) are also 
provided. The MOD-MIX model exhibited the smallest average standard 
error, although it also exhibited a higher RMSE than did the IPL model. 
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The differences observed between the standard errors and RMSE are 
difficult to interpret; particularly in light of the relatively small variation 
among the estimates for the IPL. Further analyses were planned to 
replicate the RMSE analyses with the other three sample, but 
unfortunately, these analyses could not be completed in time for this 
presentation. However, it is likely that the RMSE analyses may not be 
appropriate because the true theta values are not known. 



Table 5 

RMSE and Standard Errors of IPL and MOD-MIX Models 
Model Av g. S.E. St. Dev. Range RMSE 



M0D-2PL 


.69 


.03 


.658- 


-.796 




MOD-MIX 


.66 


.05 


.597 


-.783 


.0471 


IPL 


.69 


.02 


.657 


-.755 


.0261 



Discussion 

This investigation has first, demonstrated that restricted modeling 
can be used to investigate item parameter stability over small-samples of 
real te. i data, and second, investigated a means by which some small- 
sample test practitioners may benefit from IRT methodology. Though the 
problems with using ERT under small-sample conditions have been noted 
since the early days of IRT (Lord, 1968), little research has been done to 
redress this problem. Perhaps the bottom line is that IRT cannot be used 
with sample sizes smaller than 200 examinees, no matter how much we 
incorporate prior information and/or fiddle with the parameter estimation 
procedure. The results of this study neither reject nor support such a 
statement, and so it is clear that future research is need in this area. 
Though the challenge is great, the effort will be justified if IRT can be 
brought into the hands of small- sample test practitioners. 

One avenue for future research may be to increase the number of 
items for which aggregated data are available and include them in a 
calibration run for an actual examination. In this study, only five items 
incorporated prior information (aside from the fixed lower asymptotes on 
all items in the MOD models), and they were analyzed together with only 
e"ght other items. The contribution of prior infomiation to longer test 
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lengths, and the inclusion of prior information on more of the test items, 
are likely to improve test information. Future research should also focus 
on selecting items that will maximize a target test information curve. 

A rather atypical feamre of this study was that fairly complex IRT 
models were applied to these data, yet relatively small numbers of 
parameters were estimated. For example, in the MOD-MIX model, 
essentially a 3PL model was fit to the data, yet only 8 parameters were 
estimated for the 13 items! This reduction in the number of parameters to 
be estimated stems from the fixing of the c/'j for all items, and the fixing 
of the aj's and bfs for the five "common" items. Because the 
data/parameter ratio is the keystone for robust parameter estimation, any 
promise for the use of IRT with small data sets must concentrate on 
increasing that ratio. Though fixing item parameters reduces tlie number 
of parameters to be estimated by the model, it invokes the critical question 
"How defensible are the parameter values that are fixed in these runs?" 
The research of Divgi (1984), Barnes & Wise (1991) suggests that fixed 
cj's are defensible; however their findings must be replicated with real 
test data. Though this study offers promise for IRT application in small- 
sample settings, the stability of the item parameters gathered from test 
data aggregated over several small-sample administrations requires 
further investigation. 
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Appendix A 

MULTILOG INPUT FOR M0D-2PL MODEL 

(i.e., restricted (stability) model with fixed lower asymptotes): 

>PRO RA IN NI=52 NG=4 NE=587; 

>TEST ALL L3; 

>EQUAL AJ IT=(14(1)26) WI=:(1(1)13); 
>EQUAL AJ IT=(27(1)39) WI=(14(1)26); 
>EQUAL AJ IT=(40(1)52) WI=(27(1)39); 
>EQUAL BJ IT=(14(1)26) WI=(1(1)13); 
>EQUAL BJ IT=(14(1)26) WI=(1(1)13); 
>EQUAL BJ IT=(14(1)26) WI=(1(1)13); 

>FIX ALL CJ VA=.20; 

>END; 
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APPENDIX B 

MULTILOG INPUT FOR MOD-MIX MODEL: (Fixed parameters for 
items 3,5,7, 10,&13 based on aggregated 2PL; fixed lower asymptotes). 

>PRO RA IN NI=13 NG=1 NE=159; 
>TEST ALL Lj; 

>EQUAL AJ rr=(8,9,ll,i2) WIK1,2,4,6); 

>EQUAL AJ IT=2 WI=1; 

>EQUAL AJ IT=4 WI=1; 

>EQUAL AJ IT=6 WI=4; 

>EQUAL AJ IT=9 WI=8; 

>EQUAL AJIT=11 WI=9; 

>EQUAL AJ IT=12 WI=11; 

>FIX ALL CJ VA=.20; 

>FIX IT=3 AJ VA=.82; 
>FIXIT=3 BJ VA=-L61; 
>FIX IT=5 AJ VA=.24; 
>FIX IT=5 BJ VA=.66; 
>FIX IT=7 AJ VA=L22; 
>FIX IT=7 BJ VA=-.93; 
>FIX IT=10 AJ VA=.60; 
>FIX IT=10 BJ VA=-.15; 
>FIXIT=13 AJ VA=.62; 
>FIX IT=13 BJ VA=-2.24; 

>END; 
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